TB case reports (Solutions to Exercises)

Data sets from http://www.who.int/tb/country/data/download/en/

In [36]:
library(tidyverse)
library(stringr)
library(forcats)

Exercise (Messy to tidy)

We start with a fairly messy version of the data set in tidyr::who.

In [37]:
who  <- tidyr::who
In [38]:
head(who)
countryiso2iso3yearnew_sp_m014new_sp_m1524new_sp_m2534new_sp_m3544new_sp_m4554new_sp_m5564newrel_m4554newrel_m5564newrel_m65newrel_f014newrel_f1524newrel_f2534newrel_f3544newrel_f4554newrel_f5564newrel_f65
AfghanistanAF AFG 1980 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AfghanistanAF AFG 1981 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AfghanistanAF AFG 1982 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AfghanistanAF AFG 1983 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AfghanistanAF AFG 1984 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
AfghanistanAF AFG 1985 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Ex 1*.

  • Find the definition of the first 6 column names by doing a semi_join with dict.
In [40]:
labels  <- data.frame(name = colnames(who)[1:6])
labels
name
country
iso2
iso3
year
new_sp_m014
new_sp_m1524
In [42]:
dict_url <- "https://extranet.who.int/tme/generateCSV.asp?ds=dictionary"
if (!file.exists("dict.csv")) download.file(dict_url, "dict.csv")
dict <- read_csv('dict.csv')
Parsed with column specification:
cols(
  variable_name = col_character(),
  dataset = col_character(),
  code_list = col_character(),
  definition = col_character()
)
In [43]:
semi_join(dict, labels, by=c("variable_name" = "name"))
Warning message:
“Column `variable_name`/`name` joining character vector and factor, coercing into character vector”
variable_namedatasetcode_listdefinition
country Country identification NA Country or territory name
iso2 Country identification NA ISO 2-character country/territory code
iso3 Country identification NA ISO 3-character country/territory code
new_sp_m014 Notification NA New pulmonary smear positive cases: males aged 0-14 years (not used after 2012)
new_sp_m1524 Notification NA New pulmonary smear positive cases: males aged 15-24 years (not used after 2012)

Ex 2.

  • Create a new column called notification and gather all columns beginning wiith new there, such that their values are in a new column called cases. Remove NA values when using gather.
In [44]:
who1  <-  who %>% gather(starts_with('new'),
                         key = 'notification',
                         value = 'cases',
                         na.rm = TRUE)
In [45]:
head(who1)
countryiso2iso3yearnotificationcases
AfghanistanAF AFG 1997 new_sp_m014 0
AfghanistanAF AFG 1998 new_sp_m014 30
AfghanistanAF AFG 1999 new_sp_m014 8
AfghanistanAF AFG 2000 new_sp_m014 52
AfghanistanAF AFG 2001 new_sp_m014129
AfghanistanAF AFG 2002 new_sp_m014 90

Ex 3.

  • Replace all occurrences of the string newrel with new_rel
In [46]:
who2  <- who1 %>% mutate(notification = str_replace(notification, "newrel", "new_rel"))
In [47]:
head(who2)
countryiso2iso3yearnotificationcases
AfghanistanAF AFG 1997 new_sp_m014 0
AfghanistanAF AFG 1998 new_sp_m014 30
AfghanistanAF AFG 1999 new_sp_m014 8
AfghanistanAF AFG 2000 new_sp_m014 52
AfghanistanAF AFG 2001 new_sp_m014129
AfghanistanAF AFG 2002 new_sp_m014 90

Ex 4.

  • Split the notification column into new, type, sex, age. (Note: There are no sex unknown data here)
In [48]:
who3  <- who2 %>% separate(notification, into = c("new", "type", "sexage"))
In [49]:
head(who3)
countryiso2iso3yearnewtypesexagecases
AfghanistanAF AFG 1997 new sp m014 0
AfghanistanAF AFG 1998 new sp m014 30
AfghanistanAF AFG 1999 new sp m014 8
AfghanistanAF AFG 2000 new sp m014 52
AfghanistanAF AFG 2001 new sp m014 129
AfghanistanAF AFG 2002 new sp m014 90
In [50]:
who4  <- who3 %>% separate(sexage, into = c("sex", "age"), sep=1)
In [51]:
head(who4)
countryiso2iso3yearnewtypesexagecases
AfghanistanAF AFG 1997 new sp m 014 0
AfghanistanAF AFG 1998 new sp m 014 30
AfghanistanAF AFG 1999 new sp m 014 8
AfghanistanAF AFG 2000 new sp m 014 52
AfghanistanAF AFG 2001 new sp m 014 129
AfghanistanAF AFG 2002 new sp m 014 90

Ex 5.

  • Drop the iso2, iso3, new columns.
In [52]:
who5  <-  who4 %>% select(-iso2, -iso3, -new)
In [53]:
head(who5)
countryyeartypesexagecases
Afghanistan1997 sp m 014 0
Afghanistan1998 sp m 014 30
Afghanistan1999 sp m 014 8
Afghanistan2000 sp m 014 52
Afghanistan2001 sp m 014 129
Afghanistan2002 sp m 014 90

Exercise (Exploratory data analysis)

Ex 1.

  • For each country compute the total number of cases of TB and save as ex1
In [54]:
ex1 <- who5 %>% group_by(country) %>% summarize(count = sum(cases))
In [55]:
head(ex1)
countrycount
Afghanistan 140225
Albania 5335
Algeria 128119
American Samoa 41
Andorra 103
Angola 308365

Ex 2.

  • For each year, find the country that had the smear positive pulmonary TB. Display in chronological order.
In [56]:
who5 %>% filter(type == 'sp') %>%
group_by(year) %>%
filter(cases == max(cases)) %>%
select(year, country, cases) %>%
arrange(year)
yearcountrycases
1980 Canada 186
1981 Canada 141
1982 Canada 150
1983 Canada 123
1984 Canada 169
1985 Canada 168
1986 Canada 147
1987 Canada 129
1988 Canada 131
1989 Canada 122
1990 Canada 100
1991 Canada 110
1992 Canada 79
1993 Canada 74
1994 Canada 87
1995 China 18306
1996 China 24057
1997 China 28247
1998 China 30093
1999 China 29328
2000 India 31090
2001 India 30007
2002 India 55829
2003 India 63587
2004 India 74450
2005 India 76870
2006 India 82939
2007 India 88045
2008 India 90498
2009 India 90830
2010 India 90440
2011 India 89706
2012 India 88111

3. Express the data for China as a data frame with just 3 columns - year, f and m where the values for f and m are the sum of female and male cases in that year. Show just the 5 years with highest total count of cases.

In [57]:
who5 %>%
filter(country == 'China') %>%
group_by(year, sex) %>%
summarize(count = sum(cases)) %>%
spread(key = sex, value = count) %>%
arrange(desc(f + m)) %>%
head(5)
yearfm
2009 270530613947
2010 268998600094
2011 263933601126
2012 265110593751
2013 261049586127

Ex 4.

  • Make a scatter plot of the number of reports (each row is a report) of TB in each country with the regions on the y axis. Restrict to countries that begin with the letter A. Order the countries by number of reports and remove the y label.
In [58]:
options(repr.plot.width=4, repr.plot.height=3)
In [59]:
foo  <- who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country) %>%
summarise(n = n())

foo %>%
ggplot(aes(n, reorder(country, n))) +
geom_point() +
ylab(NULL)
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_33_1.png

Ex 5.

  • Display ex1 as a bar chart. Because of the large number of countries, place the countries on the y-axis, and display countries from fewest to most cases. Scale the counts using a log10 scale, adding 1 to each count so we don’t have to deal with negative infinity.
In [60]:
options(repr.plot.width=6, repr.plot.height=30)
In [61]:
ggplot(ex1) +
geom_bar(aes(x=reorder(country, -count), y=1+count), stat='identity') +
xlab(NULL) +
ylab("Log10(TB cases)") +
scale_y_log10() +
coord_flip()
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_36_1.png

Ex 6.

    • Make a bar chart of the number of reports (each row is a report) of TB in each country that begin with the letter A. Order the countries by number of reports increasing from left to right. Rotate the country labels so they are readable.
In [62]:
options(repr.plot.width=4, repr.plot.height=3)
In [63]:
who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_39_1.png

Ex 7.

  • Can you display as proportion instead of count?
In [64]:
who5 %>%
filter(str_detect(country, '^A')) %>%
ggplot() +
geom_bar(aes(x = country %>% fct_infreq %>% fct_rev,
             y = ..prop..,
             group = 1)) +
theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL)
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_41_1.png

Ex 8.

  • Add a label showing the first 3 letters of the country label on top of each bar, and remove the x tick and axis labels. Make the bars 50% transparent and fill by prop using a color gradient from yellow to red.
In [65]:
ex8  <- who5 %>%
filter(str_detect(country, '^A')) %>%
mutate(country = country %>% str_sub(1, 3) %>% str_to_upper) %>%
group_by(country) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
In [66]:
ex8
countrynprop
AFG 244 0.056286044
ALB 448 0.103344867
ALG 224 0.051672434
AME 172 0.039677047
AND 387 0.089273356
ANG 425 0.098039216
ANT 346 0.079815456
ARG 448 0.103344867
ARM 461 0.106343714
ARU 37 0.008535179
AUS 826 0.190542099
AZE 317 0.073125721
In [67]:
options(repr.plot.width=7, repr.plot.height=3)
In [69]:
ex8 %>%
ggplot(aes(x = reorder(country, prop),
           y = prop,
           fill  = prop))  +
geom_bar(stat = 'identity', alpha = 0.5) +
geom_text(label = reorder(ex8$country, ex8$prop),
          nudge_y = .02) +
xlab(NULL) +
theme(axis.text.x = element_blank(),
      axis.ticks.x = element_blank()) +
scale_fill_gradient(low = "yellow", high="red") +
guides(fill = FALSE)
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_46_1.png

Ex 9.

  • Make a bar chart of counts for male and female side by side with countries starting with A on the y axis
  • Set the fill of the bar chart to pink for f and blue for m
  • Add a meaningful title
In [70]:
options(repr.plot.width=6, repr.plot.height=6)
In [71]:
who5 %>%
filter(str_detect(country, '^A')) %>%
group_by(country, sex) %>%
mutate(count = sum(cases)) %>%
ggplot +
geom_bar(aes(x=reorder(country, count),
             y=count, fill=sex),
         position='dodge',
         stat='identity') +
scale_fill_manual(values = c("f" = "pink", "m" = "blue")) +
labs(title="TB cases by region and sex") +
coord_flip()
Data type cannot be displayed:
_images/CFAR_R_Workshop_2018_Exercisees_49_1.png

Ex 10.

  • Make a grid of plots where each column is a different sex), each row is a different country starting with A), and each subplot is a plot of trend over time. Allow the y - scales to vary across plots.Make the fitted curves for each sex have a different color.
In [72]:
options(repr.plot.width=6, repr.plot.height=15)
In [35]:
g  <- who5 %>% filter(str_detect(country, '^A')) %>%
ggplot(aes(x = year, y=cases, color=sex)) +
geom_smooth() +
facet_grid(country ~ sex, scales="free")
suppressWarnings(print(g))
`geom_smooth()` using method = 'loess' and formula 'y ~ x'
_images/CFAR_R_Workshop_2018_Exercisees_52_1.png
In [ ]: